Every year more than a million people lose their lives on roadways[1]. These unfortunate events happen all over the world and if they can be avoided why shouldn't they? This is the exact purpose of this project: getting insights in how future deaths and serious injuries in traffic can be prevented. This will be done by investigating the following:
The data set used to investage this is provided by the French government and contains information about traffic accidents in France from 2016-2018 (click here for data source). If we can find trends in the dataset about France, who is to say that these insights cannot be usefull for other countries as well?
We've set the stage, now let's get started.
To start of easy and briefly get to know the data, a quick overview of the accidents in France is presented below. This covers accidents per day, victims killed per year and much more. Check it out!

Let's investigate when the accidents most frequently occur. How has the road safety developed over the years considered? Is there seasonality in the number of accidents?
The number of accidents on a given day is plotted for the entire time horizon from 2016 to 2018. It can be seen that generally it has become increasingly more dangours to be a road-user during the considered time horizon, where the majority of the observations in 2018 are above the mean number of accidents per day for the entire time horizon.
Did you also notice some seasonality in the number of accidents? Let's zoom in and have a closer look.
First thing to notice is that the number of accidents per day in 2016 is not as volatile as in 2018.
For each year the number of accidents peak around the beginning of July (approx. day 180) and then decrease from there until mid August (approx. day 220) where the number of reported accidents starts to increase again. Is this expected behaviour? Yes, this could be associated with the summer holiday resulting in more vehicles on the road, both residents and tourists. Also the weather tends to be nice during summer, hence more people are outdoors getting from A to B.
Note that the number of reported accidents are lower in the winter months December and January than in the remaining months. This is surprising as one could imagine that bad weather in the winter months might affect road safety. Additionally there could be more personal vehicles on the roads as other transportation alternatives (biking, walking and public transportation) are less appealing in cold weather.
Let's zoom in even further and see how an average week looks hour by hour.
It can be seen that generally on weekdays the most accidents occur during the rush hours, i.e. in the morning from 8-10 AM and in the afternoon from 4-6 PM.
Accidents are less frequent in the weekend. This might be explained by the fact that there is less traffic on the weekends.
What is the most dangerous hour of the week to be out in traffic? Well the hour of the week with the most traffic accidents is friday afternoon at 5 PM. Hence people rushing home from work on friday afternoons makes it dangerous to be a road-user.
Now it is time to get to know the victims. Let's start off easy by considering their gender and age. The bar chart below shows how many female or male victims of a certain age are killed in traffic accidents. If you hover over the bars the frequency for a gender-age group will appear. Do what you see surprise you?
From this we see that by far more males are killed in traffic than females. For all age-groups except for people under 20, the number of killed male victims exceeds the number of killed female victims. According to [2] males tend to drive more kilometers than women, which could be one explanation for the difference. Of course not all of these people have to be drivers of cars, hence this is only a partial explanation.
Young people in the age-group 20-29 years are more often killed in traffic accidents than any other age group - this holds for both genders. However, for males more than 2000 of the killed victims in traffic accidents are between 20 and 29 years old.
One surprising tendency is that the number of female victims is relatively stable for females with age above 30, whereas it is decreasing for males. Why is this so, you ask? Hard to say exactly, but one possibility is that the behaviour of males in traffic change with age. For instance they spend less time in traffic as they get older or become more careful. Unfortunately, this is not something we can tell from the data.
Now that we know a bit more about the age and gender of the victims, let's look at the details of their involvement in the accidents as well as their trip purposes in a timely perspective. For instance from a driver of a car's perspective, at what hour of the day does the most traffic accidents occur? How does this differ for pedestrians? Besides this, it is also possible to get insights in the reason for being in traffic in the first place. Are the victims on their way to work or school when the accidents happen?
Below you are free to explore this and much more. On the left you can tick off different hourly distributions and they will appear in the plot. Have fun :)
The most accidents happen during rush hour (7-9 and 16-18) for drivers. This is not surprising, as most people are going from A to B in this time interval - many of them either to/from school or work. It is not unlikely that more people in traffic increases the likelyhood for being in an accident.
The category that really stands out is the trip purpose "professional use", which is more or less constant throughout the day, hence no peaks during rush hour. Maybe people on the job are not in such a rush or purhaps the lower number of road-user during working hours has an impact.
Are you still wondering if there is an external common denominator for all these victims? Let us see what the different surrounding conditions at the time of the accidents can tell. This includes atmospheric-, illumination-, surface-, as well as road-conditions and type. This is interesting as it might reveal some new patterns or trends in the data about the accidents. This could for example be if the amount of accidents increases as the weather changes. Let's take a look at the data, shall we?
From the plots above we see that the vast majority of accidents happen in broad daylight with normal conditions across the features! This was surprising to us since we expected more accidents in general when the weather or lighting was unfavourable. It should be noted that a lot more cars are in traffic during day than in the night and that normal weather conditions are more common than rain or snow in France. Therefore, it seems reasonable that the distribution is skewed such that it appears that accidents are more likely to happen during the day. Even though this might very well not be the case.
From the plot in the lower right corner, we also see where the accidents occur in terms of road type. Few accidents occur on highways compared to other road types. Highways are associated with high speed, so it might seem contra-intuitive that they should be safe. However the road directions are separated by crash barriers and there are no vulnerable road-users or left and right turns - which might explain why. Communal and departemental roads - similar to country roads with 2 lanes (1 in each direction) - are where most of the accidents happen by far.
Before we take another look and dive a bit deeper into these patterns (by considering the frequency of an accident with a given degree of severity) we will first consider the spatial distribution of the accidents.
Let's get into the spatial patterns in the data. Here, we will investigate the spatial dimension of the data, which could reveal some interesting relations. To get a basic idea of the distribution of the spatial data we will consider the following choropleth map. It shows the total log2 number of accidents for each department in France.
Notice that it is possible to hover the cursor over the various departments of France, which will reveal the department number as well as the total log2 number and the total number of accidents which has occured in the given department.
With this log2 representation of the data what we notice is that there seems to happen more accidents close to major cities and less somewhat in the middle of France. This pattern is expected, since it most likely reflects the population density rather than any hidden or dangerous trends. This can be confirmed by checking out this cool cite made by Map of France. Thus, we cannot immediately see any unexpected spatial trends related to the number of accidents. Instead, let us examine if there are any trends for the different degrees of severity of the accident. Perhaps this will reveal if people drive more reckless in some parts of France?
This map shows the accidents as well, but much more detailed than before. Here the points are coloured according to the severity of the accident they represent. The light green points are accident for which the victim was unscathed, orange for light injury, red for hospitalized wounded and black for victims that were killed. It should also be noted that the data shown is 4000 points picked at random to represent the data without visualising too many points. To be able to get an even better understanding of each individual accident it is possible to click on the different points to get a list of characteristics describing the accident, victim and surroundings.
In general what we see from the map is that accidents tend to cluster around larger cities of France and less so when the population density starts to decrease on the countryside. Moreover, it does seem as light or mild (orange or green) accidents tend to cluster around major cities, while the more serious accidents (red and black) happen close to but outside of the big cities.
To get detailed accident descriptions for each severity category of the accidents individually check out these awesome maps.
In the section we will try to investigate what factors matter when determining the severity of an accident. Is it purely external circumstances such as weather, lighting or road conditions that affect the outcome of an accident or does factors such as gender and sex actually have a greater influence?
But first let's have a look at how likely it is that an accident is severe.
From this it can be seen that most likely you would leave an accident unharmed or with only light injuries. That's pretty cool!
So why even bother digging into the "few" serious accidents that do occur? Every death is a tragedy. Do you really want to leave it up to chance?
In order to adjust for this unbalanced frequency between the severity categories, relative frequency plot will be used. On these plots the heights can directly be interpreted as probabilities. Pretty cool right?
Factors such as our age or gender is something that is out of our control - but would it be more safe to send your spouse or children shopping instead of going yourself? Furtunately there are still things we can control. Like whether we get behind the wheel or instead take the passenger seat. And whether or not we use a seatbelt. But does it really make a difference when it comes to the severity of an accident? These are questions that we will investigate in this section.
Does age and gender affect the severity?
As it turns out, after the age of 40 the likelihood of an accident being serious increases with age, independent of gender. This could be related to the fact that elderly people are more fragile. An increase of severe accident is also observed for the agegroup of 10-19 year olds. In France the legal driving age is 18, but if accompanied by an adult you can drive from the age of 16. It might be that these less experienced drivers more often end up in servere accidents. Young people tend to be more reckless and have a certain feeling of immortality #YOLO.
Does your role in the traffic matter?
The short answer is yes. Pedestrians are the most exposed with a relative frequency of 0.46 of getting in a serious accident (i.e. killed or hospitalized wounded). Skateboarders and scooter are also quite exposed with a relative frequency of 0.41. And despite believe it is actually the safest to be a driver.
Does it matter if you use safety equipment or not?
Intuitively and based on knowledge from various road safety campaigns emphasising the importance of the use of safety equipment, the answer would be yes. And not surprisingly this is also the message here: not using safety equipment will get you killed.
The external factors are things that are completely out of our control such as the weather condition. Here we will investigate whether you should consider look twice out the window before leaving your house and entering the dangours of the roads.
How important is the lighting?
Intuitively ligthing sounds pretty important when it comes to avoid hitting pedestrians while driving in your car at night. But you have headlights right? So does it really matter if the streets are lit as well? Let's find out.
As it turns out, street lighting might actually be quite important. Accidents occuring at night on roads with no public lighting tend to be more servere than under other light conditions.
... And what about the weather conditions?
Accidents occuring under rare weather conditions such as fog or storms tend to be more servere while it is observed that more common weather conditions like precipitation (at least in France) does not have the same negative effect on the severity.
If you are interested in exploring the relative frequency of severity for the remaining features click here. If the page does not load, try to reload. Can you find one (or more) that goes against your initial intuition?
So, what have we actually learned so far? Some things were as expected, whereas others were not.
Findings that are intuitive
Surprising findings
We hope you have enjoyed this journey so far. We are not done, the best is yet to come. In the previous section we have gotten to know the data and found a lot of insights about traffic accidents in France. The next part is going to be about what impacts the severity of an accident - not based on simple descriptive statistics but using machine learning. The idea is to train a model that is able to predict the severity of an accident based on a set of features and then afterwards investigate which features are of importance. Hopefully, this will tell us more about how future accidents can be prevented.
To investigate what impacts the severity of an accident, a classification method is used. The idea is to predict which one of four types of severities will belong to a given person in an accident based on information about the person and the environment of the accident. A proved well-performing model, XGBoost, is used for this purpose. Why XGBoost you might ask. Well..
Should we get started?
Curious about how the model has been trained? First, the data set has been balanced meaning that the number of observations with each of the four severity types are the same. Then the dataset is split into a training and a test set where 75% of the data is contained in the training set.
When solving a classification problem (here: predicting the severity of an accident), it is important to evaluate the performance of the model. How often are the predictions actually correct? One way to do this is to look at the predicted values vs. the true values, which is exactly what you see below in the Confusion matrix.
Text(237.4000000000001, 0.5, 'true value')
The accuracy of the XG Boost classifier is 47.92%
As the Confusion matrix is known to be confusing - here is an explanation of how to interpret what you see.
On the x-axis and the y-axis you see the 4 severity types (classes). On the x-axis these categories represent the predictions made by the classifier, on the y-axis the categories represent true categories. Within the matrix a lot of numbers are seen, telling how the classifier's predictions relate to the true values. Ideally, all squares except for the ones in the diagonal should contain the number 0, meaning that the classifier was correct 100% of the time. This is unfortunately not the case with our classifier. As an example it can be seen in the matrix that 414 people are predicted to be killed, but was actually unharmed in the accidents.
So how good is the classifier actually? Based on the numbers in the confusion matrix it is seen that the classifier is correct about 48% of the time.
Wondering what makes the classifier predict an actually unharmed person to be killed in an accident? Next up is an analysis of feature importance. Which factors are important in general when making the predictions? And which features are important when looking at each of the severity types?
What do you expect to have an impact in the severity of the victims? We've come across different ones in the previous section. Let's see if the classifier agrees with these.
It is seen on the feature importance graph that many different types of features are important to the classifier: roadtype, purpose, use of safety equipment, collision type and if you are a pedestrian. Especially the road type stands out as very important to the classifier.
Remember that a plot like this does not tell us anything about what type of severity is influenced by which feature - it only gives a general picture. Wouldn't it be interesting if we can tell which features have an impact on whether a victim is killed or unharmed? Luckily with this classifier we can. Let's go.
Here is an overview of feature importance for killed victims along with how the feature values impact the model's belief in the victim being killed. For clarification let's look at an example. For Departmental roads there is a clear split between blue and red observations. The color of the observations tell whether the value was high or low. In this case a high value means the accident was on a departmental road and a low value means that it wasn't on a departmental road. On the x-axis the SHAP-value is shown. This value tells whether the different features have a possitive or negative impact on the model output, which is the severity type "Killed" in this example. So what does the feature Departmental road tell us? As the high values have negative SHAP values it means that the model believes driving on departmental roads will lower the chances of a fatal outcome.
Surprised about this? Didn't we just see in the section about how badly one would be injured in an accident that Departmental roads were where most people were killed? Remember the plot above reflect what the model think is important and one explanation for this rather unexpected result could be that the classifier is not that good. 52% of the time the model misclassify. Thinking that accidents on departmental roads has no impact on whether victims are killed, might be a reason for poor performance.
Wondering what is important for the other severity types? Let's take look.
Let's look at one severity type at a time.
For unscathed victims the most important feature is the trip purpose "Professional use". Victims with this purpose type will have a higher chance of being classified as unscathed by the model. However, the third-most important feature is whether the victim was a pedestrian or not. We see, that if "pedestrian" have high values - meaning that the victim was indeed a pedestrian - the SHAP values are negative. This means that pedestrians will have a higher chance of not being classified as unharmed. Both of these agree with the initial analysis of the data performed in the previous sections.
Light injuries has communal roads as the most important feature. If the accident happened on a communal road, the negative impact means the model believes the victim is less likely of getting a light injury. As seen on the plot above the second-most and third-most important features are latitude and longitude. As the high and low feature values have both high and low SHAP values it is impossible to make a clear conclusion on how the classifier uses these features. The fourth-most important feature gender is showing that if the victim is male, the model believes the victim is more likely of getting a light injury.
For victims being Hospitalized wounded the feature departmental roads is once again an important feature. If the accident happened on a departmental the victim is more likely to be hospitalized according to the model. The same goes for victims driving a light vehicle (e.g. motor cycle, quad car, scooter). Driving one of those exposes the victims to a higher risk of a serious injury.
Let's look at the predictions of killed victims again. Surprisingly, it seems that the use of no safety equipment does not mean that the victim is more likely to be killed in an accident, which is exactly opposite of what we saw previously. Is this just another proof of the classifier not being too smart when it comes to classifying killed victims? It might very well be.
We have considered a lot of different angles on the issue of preventing serious and potentially fatal accidents. Now it is the time to ask ourselves "What have we learned today?". The initial analysis showed some intuitive trends and patterns and some which were rather surprising. This answered our questions on what, how, for who and when accidents happen. Digging a bit deeper we came closer to understanding why accidents happen, at least to some degree. Here, we learned that in general features on road type, travel purpose, safety, collision type and whether or not you are a pedestrian had a greater impact. Although this tells us these features are important, it does not tell how it affects the outcome of an accident. By inspecting the impact of each feature on all of the types of severity individually it was revealed that victims are more likely of being unharmed if they are in a traffic accident while having a professional purpose for being in traffic and less so if they are not using safety equipment or are a pedestrian. The model seems a bit more inconsistent when considering severe accidents. For example it believes driving on deparmental roads, being male and not using safety equipment lowers your chances of being killed in an accident. This goes against the initial analysis as well as our intuition.
The most important lesson the initial and data analysis has taught us is that accidents happen all the time and that everyone is at risk, even more so for the soft road users which are most vulnerable. It is therefore important to be vigilant at all time in traffic, and not only when the conditions are troublesome. We hope you have enjoyed this journey with us!
Still curious about this project and the technical details? Please explore our explainer notebook here. To get the full experience please run the code to generate all the plots.